You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+38-6Lines changed: 38 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -9,13 +9,45 @@ Code formatting is not particularly exciting but many researchers would consider
9
9
10
10
Either techniques are painful and finicky.
11
11
12
-
This repository is a step towards what we hope will be a universal code formatter that looks for patterns in a corpus and attempts to format code using those patterns.
12
+
This repository is a step towards what we hope will be a universal code formatter that uses machine learning to look for patterns in a corpus and to format code using those patterns.
13
13
14
-
## Mechanism
14
+
## Introduction
15
15
16
-
For a given language *L*, the input to CodeBuff is:
16
+
When looking at code, programmers can easily pick out formatting patterns for various constructs such as how `if` statements and array initializers are laid out. Rule-based formatting systems allow us to specify these input to output patterns. The key idea with our approach is to mimic what programmers do during the act of entering code or formatting. No matter how complicated the formatting structure is for a particular input phrase, formatting always boils down to the following four canonical operations:
17
17
18
-
1. a grammar for *L*
19
-
2. a set of input files written in *L*
20
-
3. a file written in *L* but not in the corpus that you would like to format
18
+
1.*nl*: Inject newline
19
+
2.*ws*: Inject whitespace
20
+
3.*align*: Align current token with some previous token
21
+
4.*indent*: Indent current token from some previous token
21
22
23
+
The first operation predicates the other three operations in that injecting a newline triggers an alignment or indentation. Not injecting a newline triggers injection of 0 or more spaces.
24
+
25
+
The basic formatting engine works as follows. At each token in an input sentence, decide which of the canonical operations to perform then emit the current token. Repeat until all tokens have been emitted.
26
+
27
+
To make this approach work, we need a model that maps context information about the current token to one or more canonical operations in {*nl*, *ws*, *align*, *indent*}. To create a formatter for a given language *L*, `CodeBuff` takes as input:
28
+
29
+
1. A grammar for *L*
30
+
2. A set of input files written in *L*
31
+
3. A file written in *L* but not in the corpus that you would like to format
32
+
33
+
`CodeBuff` trains a *k-Nearest-Neighbor* (*kNN*) machine learning model based upon the corpus. The *kNN* model is particularly attractive because it is very powerful yet simple and mirrors how programmers format code. Programmers scan their memory for similar context situations and apply the same rule or the rule they do most often.
0 commit comments