GPT-3 takes the Bar Exam. It achieved human parity on the "Evidence" section and came very close in "Torts", and "Civil Procedure". It did substantially worse in "Constitutional Law", "Real Property," "Contracts," and "Criminal Law".
Not bad for a first attempt, but also, not as impressive as GPT-3's other achievements. Part of the reason is that GPT-3 was not trained at all on legal documents. This is not because the researchers didn't try. They say:
"OpenAI does make some retraining or 'fine-tuning' capabilities available through its API, and these API endpoints do allow for some control of the training process like learning rates or batch sizes. We did attempt to fine tune text-davinci-003 by providing it with 200 unseen, simulated Multistate Bar Examination bar exam questions with correct and incorrect explanations. We provided the training samples both with and without explanatory text from the answer guide. In total, we trained six fine-tuned models, altering training prompts, training responses, batch size, learning rate, and prompt weighting. However, in all cases, the fine-tuned model significantly underperformed text-davinci-003 itself. Due to the scarcity of high-quality data for training and assessment, we did not pursue fine-tuning of GPT models further, and these results possibly confirm large language model fine-tuning risks observed by others." ("text-davinci-003" is the name of the exact instance of GPT-3 that was used through the OpenAI API.)
In order to pass the Bar Exam, a language model has to learn "legalese". Here's what the researchers say about "legalese":
"Legal language is notoriously complex; lawyers and other legal professionals undertake nearly a decade of education and professional training to understand and generate it. Why is this language so 'complex?' Why do so many proficient users of natural languages struggle with contracts and laws, even in their native tongue, to the point that descriptors like 'legalese' or 'lawyer speak' have become common parlance? The answer is likely two-fold. First, for both technical and cultural reasons, the grammar of legal language is significantly different than the grammar of normal language, featuring both highly-stylized customs and pedantically-precise phrasing. The resulting sentence structures are typically much larger and more complex than normal language, as the number of clauses and 'distance' over which clauses are connected exceeds the working memory of both human and non-human readers. Second, by the very nature of common law and precedent, legal language is full of semantic nuance and history. Words like 'security' that have common meaning in normal language often have different, context-specific meanings in legal language. Many words that do not occur at all in normal language, like 'estoppel' or 'indemnitor,' occur regularly in legal corpora. This semantic depth and breadth traditionally required systems that interact with legal text to embed a large amount of domain-specific knowledge."
To put this in perspective, here is their description of what a typical human has to do to achieve the desired level of mastery:
"For most test-takers, the Bar Exam represents the most significant single challenge of their academic careers. In order to be eligible, the typical applicant is required to complete at least seven years of post-secondary education, including a four-year bachelors degree and successful completion of three years of study at an ABA-accredited law school. Following graduation from law school, most applicants also invest substantial amounts of time and money into post-graduation Bar preparation training. This additional preparation is intended to not only solidify one's legal knowledge, but also critically to teach the applicant how to understand and answer the exam's questions."
It should further be noted that GPT-3 was tested only on the multiple-choice portion of the test. The Uniform Bar Examination has three components: (i) a multiple choice test, (ii) an essay test, and (iii) scenario-based performance test. GPT-3 archived human parity (and did not exceed human capability) on only 1 of 7 sections of the multiple choice portion of the test, which in turn is only 1 of 3 components of the total test.
Here's an example of what the multiple choice questions look like. The multiple choice portion of the Bar Exam usually consists of approximately 200 questions like these.
Question: A man sued a railroad for personal injuries suffered when his car was struck by a train at an unguarded crossing. A major issue is whether the train sounded its whistle before arriving at the crossing. The railroad has offered the testimony of a resident who has lived near the crossing for 15 years. Although she was not present on the occasion in question, she will testify that, whenever she is home, the train always sounds its whistle before arriving at the crossing.
Is the resident's testimony admissible?
(A) No, due to the resident's lack of personal knowledge regarding the incident in question.
(B) No, because habit evidence is limited to the conduct of persons, not businesses.
(C) Yes, as evidence of a routine practice.
(D) Yes, as a summary of her present sense impressions.
GPT Takes the Bar Exam
#solidstatelife #ai #nlp #openai #gpt #legalese