Hi Friends,

Even as I launch this today ( my 80th Birthday ), I realize that there is yet so much to say and do. There is just no time to look back, no time to wonder,"Will anyone read these pages?"

With regards,
Hemen Parekh
27 June 2013

Now as I approach my 90th birthday ( 27 June 2023 ) , I invite you to visit my Digital Avatar ( www.hemenparekh.ai ) – and continue chatting with me , even when I am no more here physically

Translate

Friday, 9 January 2004

CDAC PROPOSAL

Confidential $\qquad$ ©C-DAC, Mumbai (formerly NCST)

Review and Recommendations on Gurumine and Gurusearch

Disclaimer: This document is prepared based on the author's understanding of the system as per the demo's and interaction with Three P Consultants Pvt. Ltd.

A. Background

When a corporate needs to recruit people, they place an advertisement in the paper. This advertisement normally receives many applications, sometimes in thousands, more so, when electronic forms are accepted. However, this response contains spurious responses also, which need to be filtered out. Given the constraints on time and huge volume of data, manual filtering is not feasible. Even after removing spurious responses, there are many candidates to be reviewed and filtered further before they can be called for an interview.

Sometimes candidates do not wait for an advertisement and periodically keep dropping or emailing their resumes to corporates in the hope of getting called for an interview whenever a vacancy arises. This introduces another problem in managing Resumes in having to identify duplicates and updates to one's Resume and handling them properly.

All in all recruitment is a costly process in both time and money. To make things worse, the same process is carried out over and over again for the same posts.

Three P Consultants Pvt. Ltd. (3P), a consultant for recruiting people for the engineering and management fields, is attempting to simplify problems such as these by automating the various steps.

In the 13 years of experience, 3P has created a large databank of Resumes of nearly 3 lakh Resumes. The experience gathered and their anticipation of future trends is what drives their website RecruitGuru (http://www.recruitguru.com). This website hosts two main functionalities that any employer would need – Gurusearch and Gurumine.

This document is a brief description of these components for which 3P looks for some guidance from C-DAC Mumbai (formerly NCST) to improve their performance.

A corporate registering with 3P is given an account. Whenever a Resume is sent, it is first processed by Gurumine and stored in a database for future consideration. This processing not only identifies the relevant details of a candidate but also evaluates the strengths of a candidate and how he fares against other applicants. It can be noted here that multiple formats of a candidate Resume are stored to suit different job profiles are not to be treated as duplicates; this is partially addressed by having separate account for each employer, and limiting duplicate checking to within an account.

Once the Resumes are stored in the database, Gurusearch is used to retrieve Resumes of candidates based on a number of factors such as the estimated aptitude of the candidate, qualification, age, etc.

B. Gurumine

From a semi-structured Resume, we need to identify a select set of information for comparison, screening, etc. These include experience, educational qualification, etc.

Confidential $\qquad$ ©C-DAC, Mumbai (formerly NCST)

3P's database currently provides for 23 such fields. These are stored in a database. There is no standard format for making Resumes, which makes it challenging to extract information for these 23 fields. The order in which the various fields are filled in a Resume varies from Resume to Resume. In addition, there are several ways of stating the same thing, which complicates the task even further.

A functional profile of the candidate is also prepared to identify the top 3 core areas of the candidate. Gurumine recognises about 33 core areas. While preparing the functional profile, a raw score is assigned to the candidate for each core area. This raw score is used to compute the percentile of the candidate for each core area.

Basic Workflow

When a Resume is submitted to Gurumine, the processing involves

  1. Check for duplicates
  2. Segmentation
  3. Information Extraction
  4. Functional Profiling

Check for duplicates

When a Resume is loaded it is first checked for duplicates based on the first and last names and date of birth. In case of duplicates the HR manager can decide on which copy to keep.

Segmentation

Most Resumes have segments like personal details, work experience, technical skills and education. Segmentation involves identifying these segments and extracting relevant information within the segment. Though normally an easy task for human beings, this is complex to automate. The candidate may use different conventions for separating the different segments, as the titles/headers used for the various segments are not standard nor in any standard order.

To do segmentation, 3P has identified keywords typical to the segments. They also use a set of heuristics such as the following to help in this process.

  1. The name of the candidate is always in the largest font.

Note: They are heavily dependent on a particular format here and this may not be true always. However we need clarification from them on what happens if they do not get this.

  1. To identify an email address they search for a string having the '@' character and terminated by a country identifier like 'in' \ 'com' \ 'au' \ 'uk' ...

Note: A generic identification routine is available with C-DAC Mumbai, which doesn't require the country code for identifying an email address.

  1. They have a synonym list for segment headers. For example Relevant Experience and Job Experience mean the same.

Note that this list is not exhaustive.

This approach works to a fairly good degree of accuracy.

Information Extraction

Confidential $\qquad$ ©C-DAC, Mumbai (formerly NCST)

Once the segments have been identified, the required information is extracted from each of these segments.

Within the personal information section, they identify the name, age, address, email address and phone numbers of the candidate.

The heuristics for identification of address is that it typically follows the name; else it is clearly marked by a heading of some kind of 'Address'. The extraction is not accurate and parts of an address are sometimes omitted. The entire rote for the address identification and the nature of the same is still to be clarified.

The Experience segment is one of the relevant segments in this problem domain apart from the technical skills. The current job profile details are identified looking for keywords like current period, to date, etc. However this technique would have a problem if a person has taken a break after his last job.

After the relevant information has been extracted, the information of the candidate is displayed as a form on the left side of the screen. The right side displays the Resume of the candidate from which this information was extracted and hence one can easily verify and update (if required) the contents in the form.

Functional Profiling

The Resume is once again scanned for keywords irrespective of the segments for functional profiling. For example: keywords like marketing vice president clearly indicate that the candidate is from a finance background and hence the weight of this keyword in finance profile is very high. Similarly all keywords have been assigned weights. This weight is calculated from the Resume corpus via statistical methods and is updated every month. The raw score is computed based on the weight of all keywords present in the Resume.

While extracting keywords, it has been observed that the experience segment poses problems. For example, there was a Resume wherein a candidate had written "report to vice president" and hence the keyword "vice president" was associated with the candidate!

The raw score computed for each profile is used to compute the percentile range to give an idea where the candidate stands compared to other applicants. This is displayed using graphs for easy comparison.

C. Gurusearch

Once the Resumes have been processed and stored in the database, the short-listing of candidates for an interview involves specifying the percentile range and the various other criteria like age and qualification. Gurusearch fetches the Resumes that match the selection criteria for further perusal. 3P wanted to explore some AI technique like SOM for clustering and ranking the candidates.

D. Knowledge Bases

The knowledge bases used by the system as of now, include

  1. Keyword list

There are tables of keywords for:

a. Posts like vice president, accounts manager etc

b. Headers for segments like experience, technical skill, etc.

Confidential $\qquad$ ©C-DAC, Mumbai (formerly NCST)

c. Qualification

Note: This list is not exhaustive.

  1. Synonym list

People have various ways of stating one thing. For example, a graduate of arts may write B. A. or Bachelor of Arts. Similarly segment titles may be written as "Experience" or may be broken into "Relevant Experience" and "Other Experience", etc.

The synonym lists identify and map these various forms into one.

Apart from the operational knowledge bases listed above, 3P has knowledge bases that are currently not in use. Many of these are experimental systems designed by Mr. Parekh, Executive Director, Three P Consultants Pvt. Ltd. based on his many years of experience. Some of these are listed below.

  1. Probability chart for a candidate to be appointed to various posts depending on his current qualifications and experience. So a candidate who is a manager for the last two years has a probability of 0.97 of being appointed as a manager, or 0.95 for senior manager, 0.40 for regional manager, .20 for vice president, 0.0 for trainee.
  2. An earlier system developed (not used anymore) had a human expert classify candidates to a particular sector based on certain keywords. For example, if the job title of a candidate was regional sales manager, then the keyword sales would indicate that he was from the marketing sector. So this expert would then indicate that the candidate may be put into the marketing sector with the explanation that his job title was regional sales manager and the word sales triggered the categorisation.

E. Problems with the current system

  1. The detection of duplicate Resumes can be further improved by checking for the recent experience or education (in absence of experience), in addition to name and date of birth.

The scenario where a candidate applies for multiple posts within a company has not been considered. While applying for different posts the candidate may tailor the resume for each post. In this case the Resumes are not to be considered as duplicates as the emphasis on experience or skill may differ.

  1. When a Resume is sent by mail there may be many email ids in the text and this makes picking the candidate's email id through a generic mail-id identification routine a little tricky. However if the segmentation module identifies the right segment from which to pick the email id then this is not an issue.
  2. When a phone number was split into two lines the system was not able to capture the phone number as the length of the number did not meet the criteria for minimum length of a phone number.

Note: A word grouper could be used to combine successive number blocks into one to deal with this problem.

  1. Efficiency of address identification can be improved using a named entity recognition system.

Confidential $\qquad$ ©C-DAC, Mumbai (formerly NCST)

  1. Replacing various forms in which a given piece of information is expressed (e.g. B.E., BE, Bachelor of Engineering) by a single form (B.E.) will simplify processing while identifying keywords in a document.
  2. The period of service is not taken into consideration. Consider a scenario where there is a requirement for a DBA. If there were three applicants:
    • Candidate A, who has had seven years of experience as a DBA and is now working as a system administrator for the last two years
    • Candidate B, who has been working as a DBA for the last two years.
    • Candidate C, who worked as a DBA for the four years and took a break for one year to pursue further education.

The system currently considers only the last experience in detail. For profile matching keyword occurrence are used without regard to place and context of occurrence. This could result in wrong/poor classifications.

  1. The list of segment headers is not standardised and can grow. Segmentation can be improved to some extent using natural language techniques.
  2. The knowledge base acquisition needs to be enhanced to deal with varying styles.
  3. Profiling can be enhanced with use of ontology.

C-DAC Mumbai's role

C-DAC Mumbai can help with addressing all the problems mentioned above. In addition, C-DAC Mumbai would help in improving the accuracy of:

  • Information extracted.
  • Improving profiling

C-DAC Mumbai can also help Three P Consultants Pvt. Ltd. build a solution to rank and/or cluster candidates.

A variety of technologies are available for incorporation into the existing framework. Much of these can be incorporated as incremental refinements to the existing framework. The technologies vary in the extent of improvement, availability of attempted solutions, difficulty of configuration/training, etc. Details of this requires more elaborate analysis of the current system as well as sample data, and can be taken up once a preliminary MoU is in place.

raju

From: sasi@ncst.ernet.in

Sent: Friday, January 09, 2004 2:50 PM

To: raju

Cc: Kavitha M; sasi@yuga.ncst.ernet.in

Subject: RE: Letter of Interest

$\rightarrow$ CDAC Response

Dear Shri Raju Kapoor,

Thanks for the feedback. In order to move towards an MoU, we need some thoughts from you on

  1. areas which you would like to take up in decreasing order of priority
  2. the model of NCST involvement you have in mind.

Based on this we can work on an MoU. We may need one or two rounds of discussions before we freeze the items 1 and 2. But based on the broad description provided in our document, we would like a formal note from you

sharing your thoughts on 1 and 2 above, along with any constraints on

time

and budget you would like to mention. We can then discuss internally, and then

have a discussion jointly to finalise the MoU terms.

- Sasi

Quoting Raju Kapoor raju@3pjobs.com:

$\rightarrow$ Original message from Raju Kapoor raju@3pjobs.com $\rightarrow$

Date: Wed, 7 Jan 2004 18:42:12 +0530

From: Raju Kapoor raju@3pjobs.com

Reply-To: Raju Kapoor raju@3pjobs.com

To: Kavitha M kavitham@ncst.ernet.in

Hi Kavitha,

I see from the document you sent that you have a fair understanding of RecruitGuru. Some minor aberrations can be explained when we meet next. We would be glad to move to the next stage (MOU). Kindly let me know our input at this stage.

Kind Regards

Raju Kapoor

Principal

3P CONSULTANTS PVT. LTD.

Member of PENRHYN International (www.penrhyn.com)

http://www.penrhyn.com

Member AESC (www.aesc.org http://www.aesc.org)

+91-22-2850 5800 (Office) Ext.15

+91-22-2850 5656 (Direct)

+91-9821111969 (Mobile)

+91-22-2850 6663 (Fax)

www.3pjobs.com http://www.3pjobs.com/

raju

From: sasi@ncst.ernet.in

Sent: Friday, January 09, 2004 3:05 PM

To: raju

Cc: Kavitha M; sasi@yuga.ncst.ernet.in

Subject: Re: Missed out on one point

$\rightarrow$ CDAC Response

We will certainly be interested in incorporating machine learning in the system, and we do believe it will add a lot of strength to the system.

However, there are various areas where ML ideas can be incorporated and various tools are likely to be suitable for these.

Some elements can be built in during the initial stage itself, and some of them may need some more time and can be taken as Phase II.

We will proceed with this, as soon as we get a formal acknowledgement of our earlier proposal and a direction to go ahead along with thoughts on time frame, model(s) of cooperation, areas to be taken up, etc.

- Sasi

Quoting Raju Kapoor raju@3pjobs.com:

$\rightarrow$ Original message from Raju Kapoor raju@3pjobs.com

Date: Wed, 7 Jan 2004 18:54:13 +0530

From: Raju Kapoor raju@3pjobs.com

Reply-To: Raju Kapoor raju@3pjobs.com

Subject: Missed out on one point

To: Kavitha M kavitham@ncst.ernet.in

Hi Kavitha

We were also looking at the possibility of building a feedback loop to make the system self learning. If you think this could be too large an engagement to start with, you may take it as phase II in the agreement.

Regards

Raju Kapoor

Principal

3P CONSULTANTS PVT. LTD.

Member of PENRHYN International (www.penrhyn.com)

http://www.penrhyn.com

Member AESC (www.aesc.org http://www.aesc.org)

+91-22-2850 5800 (Office) Ext.15

+91-22-2850 5656 (Direct)

+91-9821111969 (Mobile)

+91-22-2850 6663 (Fax)

www.3pjobs.com http://www.3pjobs.com/

 

 









No comments:

Post a Comment