Skip to content

Commit 1ab43de

Browse files
committed
update summary
1 parent 5e1d1fc commit 1ab43de

1 file changed

Lines changed: 38 additions & 6 deletions

File tree

README.md

Lines changed: 38 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -9,13 +9,45 @@ Code formatting is not particularly exciting but many researchers would consider
99

1010
Either techniques are painful and finicky.
1111

12-
This repository is a step towards what we hope will be a universal code formatter that looks for patterns in a corpus and attempts to format code using those patterns.
12+
This repository is a step towards what we hope will be a universal code formatter that uses machine learning to look for patterns in a corpus and to format code using those patterns.
1313

14-
## Mechanism
14+
## Introduction
1515

16-
For a given language *L*, the input to CodeBuff is:
16+
When looking at code, programmers can easily pick out formatting patterns for various constructs such as how `if` statements and array initializers are laid out. Rule-based formatting systems allow us to specify these input to output patterns. The key idea with our approach is to mimic what programmers do during the act of entering code or formatting. No matter how complicated the formatting structure is for a particular input phrase, formatting always boils down to the following four canonical operations:
1717

18-
1. a grammar for *L*
19-
2. a set of input files written in *L*
20-
3. a file written in *L* but not in the corpus that you would like to format
18+
1. *nl*: Inject newline
19+
2. *ws*: Inject whitespace
20+
3. *align*: Align current token with some previous token
21+
4. *indent*: Indent current token from some previous token
2122

23+
The first operation predicates the other three operations in that injecting a newline triggers an alignment or indentation. Not injecting a newline triggers injection of 0 or more spaces.
24+
25+
The basic formatting engine works as follows. At each token in an input sentence, decide which of the canonical operations to perform then emit the current token. Repeat until all tokens have been emitted.
26+
27+
To make this approach work, we need a model that maps context information about the current token to one or more canonical operations in {*nl*, *ws*, *align*, *indent*}. To create a formatter for a given language *L*, `CodeBuff` takes as input:
28+
29+
1. A grammar for *L*
30+
2. A set of input files written in *L*
31+
3. A file written in *L* but not in the corpus that you would like to format
32+
33+
`CodeBuff` trains a *k-Nearest-Neighbor* (*kNN*) machine learning model based upon the corpus. The *kNN* model is particularly attractive because it is very powerful yet simple and mirrors how programmers format code. Programmers scan their memory for similar context situations and apply the same rule or the rule they do most often.
34+
35+
## Mechanism
36+
37+
### Features
38+
39+
1. INDEX_PREV_TYPE
40+
1. INDEX_PREV_EARLIEST_RIGHT_ANCESTOR
41+
1. INDEX_CUR_TYPE
42+
1. INDEX_MATCHING_TOKEN_DIFF_LINE
43+
1. INDEX_FIRST_ON_LINE
44+
1. INDEX_EARLIEST_LEFT_ANCESTOR
45+
1. INDEX_ANCESTORS_CHILD_INDEX
46+
1. INDEX_ANCESTORS_PARENT_RULE
47+
1. INDEX_ANCESTORS_PARENT_CHILD_INDEX
48+
1. INDEX_ANCESTORS_PARENT2_RULE
49+
1. INDEX_ANCESTORS_PARENT2_CHILD_INDEX
50+
1. INDEX_ANCESTORS_PARENT3_RULE
51+
1. INDEX_ANCESTORS_PARENT3_CHILD_INDEX
52+
1. INDEX_ANCESTORS_PARENT4_RULE
53+
1. INDEX_ANCESTORS_PARENT4_CHILD_INDEX

0 commit comments

Comments
 (0)